21 research outputs found

    SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DSYARTHRIC SPEECH RECOGNITION

    Get PDF
    Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech is required, which is not readily available for dysarthric talkers. In this dissertation, we investigate dysarthric speech augmentation and synthesis methods. To better understand differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels, a comparative study between typical and dysarthric speech was conducted. These characteristics are important components for dysarthric speech modeling, synthesis, and augmentation. For augmentation, prosodic transformation and time-feature masking have been proposed. For dysarthric speech synthesis, this dissertation has introduced a modified neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. In addition, we have extended this work by using a label propagation technique to create more meaningful control variables such as a continuous Respiration, Laryngeal and Tongue (RLT) parameter, even for datasets that only provide discrete dysarthria severity level information. This approach increases the controllability of the system, so we are able to generate more dysarthric speech with a broader range. To evaluate their effectiveness for synthesis of training data, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has a significant impact on the dysarthric ASR systems

    Accurate synthesis of Dysarthric Speech for ASR data augmentation

    Full text link
    Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems can help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech, which is not readily available for dysarthric talkers. This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation. Differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels are important components for dysarthric speech modeling, synthesis, and augmentation. For dysarthric speech synthesis, a modified neural multi-talker TTS is implemented by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. To evaluate the effectiveness for synthesis of training data for ASR, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has significant impact on the dysarthric ASR systems. In addition, we have conducted a subjective evaluation to evaluate the dysarthric-ness and similarity of synthesized speech. Our subjective evaluation shows that the perceived dysartrhic-ness of synthesized speech is similar to that of true dysarthric speech, especially for higher levels of dysarthriaComment: arXiv admin note: text overlap with arXiv:2201.1157

    Bilingual Streaming ASR with Grapheme units and Auxiliary Monolingual Loss

    Full text link
    We introduce a bilingual solution to support English as secondary locale for most primary locales in hybrid automatic speech recognition (ASR) settings. Our key developments constitute: (a) pronunciation lexicon with grapheme units instead of phone units, (b) a fully bilingual alignment model and subsequently bilingual streaming transformer model, (c) a parallel encoder structure with language identification (LID) loss, (d) parallel encoder with an auxiliary loss for monolingual projections. We conclude that in comparison to LID loss, our proposed auxiliary loss is superior in specializing the parallel encoders to respective monolingual locales, and that contributes to stronger bilingual learning. We evaluate our work on large-scale training and test tasks for bilingual Spanish (ES) and bilingual Italian (IT) applications. Our bilingual models demonstrate strong English code-mixing capability. In particular, the bilingual IT model improves the word error rate (WER) for a code-mix IT task from 46.5% to 13.8%, while also achieving a close parity (9.6%) with the monolingual IT model (9.5%) over IT tests

    Comparison of <i>SINR</i> calculation for conventional MVDR, PSO-MVDR, GSA-MVDR, SLGSA-MVDR [28] and ECGSA-MVDR for user at 0° and interference at 30°.

    No full text
    <p>Comparison of <i>SINR</i> calculation for conventional MVDR, PSO-MVDR, GSA-MVDR, SLGSA-MVDR [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0156749#pone.0156749.ref028" target="_blank">28</a>] and ECGSA-MVDR for user at 0° and interference at 30°.</p

    Comparison of performance of power response with 100 iterations for user at 0°with two interferences at 30° and 50°.

    No full text
    <p>(a) MVDR, (b) PSO-MVDR, (c) GSA-MVDR, (d) SLGSA-MVDR [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0156749#pone.0156749.ref028" target="_blank">28</a>] and (e) ECGAS-MVDR.</p

    An Experience Oriented-Convergence Improved Gravitational Search Algorithm for Minimum Variance Distortionless Response Beamforming Optimum

    No full text
    <div><p>An experience oriented-convergence improved gravitational search algorithm (ECGSA) based on two new modifications, searching through the best experiments and using of a dynamic gravitational damping coefficient (<i>α</i>), is introduced in this paper. ECGSA saves its best fitness function evaluations and uses those as the agents’ positions in searching process. In this way, the optimal found trajectories are retained and the search starts from these trajectories, which allow the algorithm to avoid the local optimums. Also, the agents can move faster in search space to obtain better exploration during the first stage of the searching process and they can converge rapidly to the optimal solution at the final stage of the search process by means of the proposed dynamic gravitational damping coefficient. The performance of ECGSA has been evaluated by applying it to eight standard benchmark functions along with six complicated composite test functions. It is also applied to adaptive beamforming problem as a practical issue to improve the weight vectors computed by minimum variance distortionless response (MVDR) beamforming technique. The results of implementation of the proposed algorithm are compared with some well-known heuristic methods and verified the proposed method in both reaching to optimal solutions and robustness.</p></div

    Mass acceleration toward the result force in GSA [20].

    No full text
    <p>Mass acceleration toward the result force in GSA [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0156749#pone.0156749.ref020" target="_blank">20</a>].</p

    Comparison of weight vectors for conventional MVDR, PSO-MVDR, GSA-MVDR, SLGSA-MVDR [28] and ECGSA-MVDR for user at 0° and interferences at 30°and 50°.

    No full text
    <p>Comparison of weight vectors for conventional MVDR, PSO-MVDR, GSA-MVDR, SLGSA-MVDR [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0156749#pone.0156749.ref028" target="_blank">28</a>] and ECGSA-MVDR for user at 0° and interferences at 30°and 50°.</p
    corecore